feat(engine): wire extensions and capabilities into runtime pipeline#2860
Open
gouslu wants to merge 24 commits into
Open
feat(engine): wire extensions and capabilities into runtime pipeline#2860gouslu wants to merge 24 commits into
gouslu wants to merge 24 commits into
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extension and node URNs now share one canonical 4-segment shape `urn:<namespace>:<kind>:<id>` (e.g., `urn:microsoft:extension:azure_identity_auth`), mirroring the receiver/processor/exporter convention. Short forms `<kind>:<id>` work for both, expanding to `urn:otel:<kind>:<id>`. The previous 3-segment extension form (`urn:<ns>:<id>`) is no longer accepted. Misplacement errors include actionable hints, e.g. '(declare under `extensions:` instead of `nodes:`)'. - New private `crates/config/src/urn.rs` factors out the shared parser core (`parse_kinded_urn`, `build_canonical_urn`, segment validators, misplacement-hint formatter). - `node_urn.rs` and `extension_urn.rs` now delegate to it; their accepted-kind sets are disjoint, so `NodeUrn` can never parse an extension URN and vice versa. - Removed `NodeKind::Extension` from the node enum and all defensive match arms in `pipeline.rs`, `controller/lib.rs`, `controller/startup.rs`, `engine/lib.rs`. - Removed `Error::ExtensionInNodesSection` (became unreachable — type-rejected at parse). - Updated YAML fixtures, bundled `configs/fake-with-extension.yaml`, and all `extension_e2e.rs` test URN constants to the 4-segment form. - Added regression tests for misplacement-hint error messages. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2860 +/- ##
==========================================
- Coverage 86.26% 86.09% -0.18%
==========================================
Files 715 722 +7
Lines 272123 273773 +1650
==========================================
+ Hits 234751 235708 +957
- Misses 36848 37541 +693
Partials 524 524
🚀 New features to boost your workflow:
|
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… validation
The file contains Jinja placeholders ({{core_start}}, {{core_end}}) but
was missing the .j2 suffix, so the validate-configs CI script picked it
up as a real config and failed YAML parsing. Sibling otlp-otlp.yaml.j2
and otlphttp-otlphttp.yaml.j2 already use the .j2 convention; align
otap-otap with them and update the three suite references.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jmacd
approved these changes
May 12, 2026
Contributor
jmacd
left a comment
There was a problem hiding this comment.
The change from Box<> to Rc<> occupies most of this PR, makes sense.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
lalitb
reviewed
May 12, 2026
lalitb
reviewed
May 12, 2026
lalitb
reviewed
May 12, 2026
lalitb
reviewed
May 12, 2026
lalitb
reviewed
May 12, 2026
lalitb
reviewed
May 12, 2026
| #[test] | ||
| fn accepts_ref_self_with_lifetime() { | ||
| assert!(validate("trait Cap { fn get<'a>(&'a self) -> &'a str; }").is_none()); | ||
| } |
Member
There was a problem hiding this comment.
Can we add a test for method type generics too? Since the generated handles are Box<dyn ...>, I’m not sure fn get<T>(&self, key: T) is actually dyn-compatible.
lalitb
reviewed
May 12, 2026
lalitb
reviewed
May 12, 2026
Address review feedback on PR open-telemetry#2860: - Reject method-level generic type and const parameters in validate_trait with a clear syn::Error. The macro generates Box<dyn local/shared::Trait> handles, and dyn-compatibility forbids generic type/const params on dispatchable methods. Lifetimes are unaffected and remain accepted. Previously the trait would expand and only fail later with a deep E0038 error. Refs: open-telemetry#2860 (comment) - Reject destructured argument patterns (e.g. `(a, b): (u64, u64)`) in validate_trait with a syn::Error instead of a proc-macro panic deep in adapter codegen. The leftover defensive check in the catches this up-front. Refs: open-telemetry#2860 (comment) - Update module docs to correct the misleading claim that method-level generics are supported, and document why generic type/const params on methods can't work with the Box<dyn Trait> handles. - Add tests covering the new validation paths (lifetime accepted, generic type rejected, generic const rejected, destructured parameter rejected). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review feedback on PR open-telemetry#2860: the requirements doc described `shared` as "one runtime instance shared across cores", which conflicts with the architecture doc's correct statement that pipeline-scoped shared extensions are still instantiated per pipeline instance (per core). Reconcile both docs around the same model: - Execution model (`local` / `shared`) defines the *type constraints* - Sharing boundary is determined by the *scope* at which the extension is declared. In Phase 1 the only scope is pipeline, so `shared` extensions are per-pipeline-instance (per core), with the `Send + Clone` bounds enabling true cross-core sharing later when group / engine scopes land. Changes: - extension-requirements.md: rewrite the execution-model table to surface type constraints and Phase 1 sharing boundary explicitly, add a Phase 1 callout for `shared`, and reframe the surrounding prose so it no longer reads as "shared = cross-core in Phase 1". - extension-system-architecture.md: append a forward pointer in the thread-per-core requirement row so a table-only reader sees the per-core caveat already documented further down. Refs: open-telemetry#2860 (comment) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
gouslu
added a commit
to gouslu/otel-arrow
that referenced
this pull request
May 12, 2026
…bounded drain Address review feedback on PR open-telemetry#2860: - runtime_pipeline.rs: refactor the run loop into an inner async block whose Result is captured, so cleanup runs on every exit path ("async finally"). Previously, on any error the loop did `return Err(e)` immediately and dropped already-started extensions without sending Shutdown — extensions owning sockets, files, or background work missed their cleanup signal. Now the outer block always calls broadcast_shutdown() (idempotent on the normal path) and drain_until_deadline().await before propagating the loop's result. Behavior on the normal path is unchanged. Refs: open-telemetry#2860 (comment) - extension_lifecycle.rs: add drain_until_deadline() that waits for remaining active+background extension tasks to finish but never past the broadcast deadline (plus a small slack to absorb the ControlChannel adapter's deferred Shutdown delivery and the task-return / JoinHandle-observed latency). A misbehaving extension that ignores Shutdown can no longer hang the pipeline indefinitely — it is dropped after EXTENSION_SHUTDOWN_GRACE + EXTENSION_SHUTDOWN_DRAIN_SLACK with a warning. broadcast_shutdown now also stashes the deadline so the drain stays consistent with what the wire-level message advertised. Refs: open-telemetry#2860 (comment) - extension_e2e.rs: regression test test_other_extensions_receive_shutdown_when_pipeline_errors — configures one failing extension and one shutdown-recording extension (each bound by its own probe receiver to survive defined-but-unbound pruning). When the failing extension's start() errors, the pipeline aborts but the recording extension must still observe Shutdown before run_forever returns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…bounded drain On error, the run loop now flows through an "async finally" block that broadcasts Shutdown to remaining extensions and bounded-drains them before propagating the error, instead of dropping live extensions mid-run. The drain is capped at EXTENSION_SHUTDOWN_GRACE + a small slack so a misbehaving extension can't hang the pipeline. Adds a regression test covering the error path. Refs: open-telemetry#2860 (comment) Refs: open-telemetry#2860 (comment) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bb54448 to
d17b453
Compare
Mirror the existing per-node validation pass for extensions in `validate_pipeline_components`. With `NodeKind::Extension` removed, unknown extension URNs and bad extension configs were no longer caught at startup and only surfaced at runtime when the engine tried to resolve them. Now they fail fast with the same kind of `Unknown ... component` / `Invalid config for ...` message we already give for nodes. Adds a unit test covering the unknown-URN path. Refs: open-telemetry#2860 (comment) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The config referenced `urn:otap:extension:sample_shared_key_value_store`, which has no registered ExtensionFactory anywhere in the binary. Until this PR's earlier validation tightening (f9a67ea) the validate-configs CI job missed it because extension URNs weren't walked. With the fix in place, this YAML now fails validate-configs. It also has no test/script/doc consumers — only the validate-configs sweep was loading it. The schema is still demonstrated by the `test_extension_with_config_and_capabilities` unit test in crates/config/src/pipeline.rs, so deleting the orphan loses no documentation value. A runnable demo extension can land in a follow-up alongside a real factory. Refs: open-telemetry#2860 (comment) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a "Runtime cost" sub-paragraph under the existing "Extensions start first, shut down last" key design decision in extension-system-architecture.md. Spells out that start() (and any capability method dispatched on that runtime) shares the per-core async runtime with the data path, so blocking I/O or CPU-heavy work in those bodies starves the pipeline on that core. Recommends spawn_blocking, a worker thread, or Rayon for off-runtime work. Same guidance applies to background extensions. Refs: open-telemetry#2860 (comment) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Summary
Part 4 of the Extension System (P1) series. Wires the previously
landed Capability Registry & Resolver (#2732) into the runtime
pipeline so extensions are actually instantiated, started, and
shut down by the engine, and so consumer nodes can resolve their
capability bindings at build time.
Highlights:
runtime_pipeline.rs: extension lifecycleis invoked before any data-path node is spawned, and
Shutdownis delivered to extensions only after the data path drains
("started first, shut down last"). Active and passive extensions
are handled separately; failures abort startup cleanly.
Box-clone factory pattern, removing the prior asymmetry between
the two trait variants.
crates/engine/src/capability/:NoOpStatelessandNoOpStateful. They exercise every codegen path of the#[capability]proc macro (&self× {sync, async},&mut self× {sync, async}, borrowed/owned returns, etc.).
crates/engine/tests/extension_e2e.rs(26 tests) covering:passive/active/background extensions, lifecycle ordering and
shutdown ordering, fail-fast on extension errors, dual-variant
pruning, one-shot capability enforcement (all accessor
combinations), shared mutable state across consumers via
Arc/Rcfor both local and shared trait variants, async&mut selfinvocation through boxed handles, and activeextensions mutating shared state observed by capability
consumers.
start-first/shut-down-last invariant (it orders lifecycle
calls, not init completion) and a noted future consideration
to add an opt-in readiness probe if/when an extension needs an
init-complete guarantee.
4-segment form
urn:<namespace>:extension:<id>(mirroring thereceiver/processor/exporter convention), with a short form
extension:<id>. The shared parser core lives in a newprivate
crates/config/src/urn.rs;node_urn.rsandextension_urn.rsdelegate to it with disjoint accepted-kindsets so the two URN types cannot be confused. As a consequence,
NodeKind::Extensionand the now-unreachableError::ExtensionInNodesSectionare removed. Misplacementerrors include actionable hints (e.g. "declare under
extensions:instead ofnodes:").in
core-nodesandcontrib-nodes) updated to accept the new&Capabilitiesparameter; existing factories that don't dependon any capability simply ignore it.
What issue does this PR close?
How are these changes tested?
extension_e2e.rsintegration test (26 tests) exercises thewiring end-to-end against synthetic receivers/processors/
exporters/extensions.
urn.rscover the shared parser core and themisplacement-error hints; existing
extension_urnandnode_urntests updated to assert the canonical 4-segment form.in the
nodes:section and node URNs in theextensions:section.
cargo xtask check(structure check +fmt+clippy --workspace --all-targets -- -D warnings+cargo test --workspace) passescleanly. No new clippy warnings.
Are there any user-facing changes?
Yes:
urn:<namespace>:extension:<id>(4-segment) instead of thepre-existing 3-segment
urn:<namespace>:<id>. Short formextension:<id>(expands tourn:otel:extension:<id>) isavailable as a developer convenience. Existing 3-segment
extension URNs in pipeline configs must be updated; the bundled
configs/fake-with-extension.yamlshows the new shape.Extensiontrait,ExtensionWrapper::buildertypestate, theextension_capabilities!macro, and the test capabilitiesNoOpStateless/NoOpStatefulare now reachable forexternal extension authors. The architecture doc captures the
lifecycle contract.
&Capabilitiesas aparameter; existing custom factories will need to accept (and
may ignore) this new argument